How to manipulate a gene list

The golden rule before we start Always keep a backup of the output from the pipeline, never work on the original data!

There are various tasks you might wish to perform on a gene list

  • Search for our favourite gene
  • Sort / Rank according to statistic or p-value
  • Filter to obtain genes with particular cut-off and fold-change
  • Re-order / re-arrange columns

Our recommended tools to perform such operations would be a programming language such as R / Python so that the operations can be scripted and automated. As a compromise, our Galaxy server provides various operations through an intuitive interface and allows the user to build a workflow to a chain of simple operations.

Different online tools or GUIs will take different types of input, but will probably involve some combination of the operations mentioned above. Therefore, we will introduce some tools in Galaxy that should allow you to manipulate your data into the required form. Later, we will give examples of workflows for particular gene set enrichment / pathways analysis tools.

In the first generic example, we will show how to filter a gene list using Galaxy:-

How to apply an alternative filter to the gene list in Galaxy

Upload the gene list in csv form

The gene list .csv file can be uploaded into Galaxy. In this example we want to choose the file with no p-value cut-offs applied

Convert to tabular

Before we can go further, Galaxy needs to convert the data you just uploaded into tabular form by replacing commas with a tab. This can be done using the Text Manipulation -> Convert delimiters to TAB option. Make sure that Commas is selected from the drop-down.

Filter

We can apply filtering using the menu option Filter and Sort -> Filter data on any column using simple expressions. Here we use the condition c7 <0.01 and c3 > 1.5 to require that Column 7 (adjusted p-value) is less than 0.05 and Column 3 (log fold-change) is greater than 1.5.

You could also use this menu to filter on genes on a particular chromosome or within a certain range of start and end positions

Selecting columns

Sorting by a column

We can also sort the table by values in a particular column. e.g. the log\(_2\) fold change.

  • can also sort by multiple columns
    • e.g. Chromosome ## R workflow

For those keen on R, equivalent operations can be performed with the dplyr package. See our intermediate R course for details.

library(dplyr)
deTable <- read.csv("t47d_Treatment_DEA_Prog-vs-Control_all.csv")
filteredTable <- filter(deTable, padj < 0.05, log2FoldChange > 1.5)
sortedTable <- arrange(filteredTable, log2FoldChange)
reducedTable <- select(filteredTable, X,baseMean,log2FoldChange)

Gene-Ontologies and Pathways

Gene-Ontology Analysis

In the early days of microarray analysis, people were happy if they got a handful of differentially-expressed genes that they could validate or follow-up. However, with later technologies (and depending on the experimental setup) we might have thousands of statistically-significant results, which no-one has the time to follow-up. Also, we might be interested in pathways / mechanisms that are altered and not just individual genes.

In this section we move towards discovering if our results are biologically significant. Are the genes that we have picked statistical flukes, or are there some commonalities.

There are two different approaches one might use, and we will cover the theory behind both

Theory Part I: Over-representation analysis

  • “Threshold-based”: require defintion of a statistical threshold to define list of genes to test (e.g. FDR < 0.01)
  • Hypergeometric test or Fisher’s Exact test generally used.

The question we are asking here is;

“Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?”

Where Theme X could be genes belonging to a particular GO (Gene Onotology) term or pathway.

Let’s imagine that we have a bag full of balls. Each balls represents a gene in the gene universe. - Paint the balls representing our selected list grey, and paint the rest red.

In this small example, we can know in advance the total number of balls and total number in our category of interest

  • Total number of balls: 40
  • Total number of interesting (grey) balls: 10

Now, lets select a number (say, 12) of the balls at random without seeing into the bag and look at what we get

We can express a particular choice of balls as a table:-

In Selection Not In Selection Total
Grey Selected 8 2 10
Grey Not Selected 4 26 30
Total 12 6 40

The formula for Fishers exact test is;

\[ p = \frac{\binom{a + b}{a}\binom{c +d}{c}}{\binom{n}{a +c}} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} \] with :-

In Selection Not In Selection Total
Grey Selected a b a + b
Grey Not Selected c d c + d
Total a + c b +d a + b + c + d (=n)

or less formally;

P = (ways of choosing grey balls) X (ways of non-grey balls amongst subset) / ways of choosing subset

Fortunately, we have software to calculate these quantities!

Software for conducting a over-representation test (goseq)

We will be using goseq, which is a software package available through Bioconductor. However, rather having to write R code, we will be using the package through our institute’s Galaxy server.

This package has been specifically-developed for use with RNA-seq data. Plenty of methods have been applied to microarray data, but the assumptions might not hold for RNA-seq. For instance, goseq will adjust for the length of a gene.

Preparing the data for an over-representation test

We can start with the list that contains results for all genes; Let’s take t47d_Treatment_DEA_Prog-vs-Control_all.csv

deTable <- read.csv("t47d_Treatment_DEA_Prog-vs-Control_all.csv")
deTable

The goal is to obtain a table with two columns; the first containing the gene identifier, and the second being TRUE or FALSE indicating whether the gene is differentially-expressed at a given cut-off (e.g. 0.05).

There are several ways of doing this. If you wish to see the R code (using dplyr), click the ‘Code’ button on the right.

library(dplyr)
deTable <- read.csv("t47d_Treatment_DEA_Prog-vs-Control_all.csv")
mutate(deTable, DE = padj < 0.05) %>% 
  mutate(DE = ifelse(is.na(DE),FALSE,DE)) %>% 
  select(X, DE) -> filteredTable
  write.table(filteredTable,"de-table-for-goseq.txt",quote=FALSE,row.names=FALSE)
filteredTable

Manipulating the file can also be done in Galaxy.

Importing the Gene list into Galaxy

Import the csv file

Get Data -> Upload Data

Convert to tabular form

Text Manipulation -> Convert delimiters to TAB

Select Commas from drop-down menu

Remove header line

Text Manipulation -> Select last lines from a dataset

Make note of how many lines in file, N and remove N-1 last lines with this tool

Retain columns of interest

Put c1,c2,c3,c4,c5,c6,c7,c12 to retain columns 1,2,3,4,5,6,7 and 12

Text Manipulation -> Cut columns from a table

The cleaned dataset

The output from the previous step will be referred to as our cleaned dataset

Performing a Gene Set Analysis in Galaxy

From our cleaned table in the previous steps, we need to compute if each gene is DE at 0.05 significance level

Adding an extra column

Text Manipulation -> Compute an expression on every row

Use the expression c7<0.05 to test if the adjusted p-value is less than 0.05

Which should give something like this

Extracting the columns needed for goseq

Text Manipulation -> Cut columns from a table Choose c1,c9 as the columns to cut

Which should give the following

Running goseq

*NGS:RNA-seq -> goseq

Select Genome version hg19 in this case. Could also specify which type of identifiers can be found in the first column. Ensembl ID is correct in this case.

Exercise:

The differentially-expressed genes that we identified could either by up- or down-regulated. Sometimes we might want to look for over-represented genes in the up- or down-regulated genes only

  • Let’s suppose we want to analyse the up-regulated genes only
  • Consider what changes you would need to make to the workflow and perform the analysis

Theory Part II: Threshold-free

For these tests, we don’t have to specify a statistical threshold and use the test statistics from all genes as the input to the test. The popular Gene Set Enrichment Analysis (GSEA) uses this approach. These tests can be used to detect differential expression for a group of genes, even when the effects are too small or there is too little data to detect the genes individually.

Subramanian et al, PNAS 2005

Subramanian et al, PNAS 2005

  • Rank all your genes according to test statistic or fold-change
  • For a particular gene set of interest, look down the gene list and compute a score
    • Go up one unit on the y-axis if the gene you encounter is contained in your gene list
    • Go down one unit if the gene is not contained in the gene list
    • Plot the score against position in gene list
    • If there is a peak at the left or right, then your set is enriched.

Let’s say that we have 10 genes “of interest” and can look-up their statistics from the table

The Broad institute provides a version of GSEA that can be run via a java application. Generically, the input to GSEA is matrix of expression values / counts and a file describing which covariate each sample belongs to. However, whether or not the method can be applied directly to RNA-seq data is under debate. According to the GSEA faq

The GSEA team has yet to determine whether any of these ranking statistics, originally selected for their effectiveness when used with expression data derived from DNA Microarray experiments, are appropriate for use with expression data derived from RNA-seq experiments. We hopefully will be able to devote some time to investigating this, but in the mean time, we are recommending use of the GSEAPreranked tool for conducting gene set enrichment analysis of data derived from RNA-seq experiments.

The GSEAPreranked tool mentioned is available through the Broad insitute’s GenePattern analysis suite. However, we don’t want to introduce yet another analysis suite, so we will focus on how to perform this analysis in Galaxy.

Preparing the gene set for a GSEA analysis

library(dplyr)
deTable <- read.csv("deseq2_results/t47d_Treatment_DEA_Prog-vs-Control_all.csv")
deTable %>% arrange(desc(stat)) %>% 
  dplyr:::select(symbol, stat) %>% 
  filter(!is.na(symbol)) -> orderedTable
orderedTable

Ranking the genes

From the cleaned table created above, sort on column 5 (the test statistic)

Filter and Sort -> Sort data in ascending or descending order

Giving the output:-

Cutting the columns

Extract columns 8 (gene symbol) followed by test statistic; c8,c5

Text Manipulation -> Cut columns from a table

Running a GSEA analysis

Using the fGSEA tool

fGSEA is a Bioconductor package that implements the pre-ranked GSEA analysis. It also happens to be a faster implementation, as described in their pre-print

GSEA uses a set of pre-defined gene sets in their analysis which come under the following categories:-

  • H hallmark gene sets
  • C1 positional gene sets
  • C2 curated gene sets
  • C3 motif gene sets
  • C4 computational gene sets
  • C5 GO gene sets
  • C6 oncogenic signatures
  • C7 immunologic signatures

Gives all results ranked according to..

  • Can we identify gene sets that are significant at 0.05?
  • What genes are over- and under-represented?
  • Can you run the tool on another dataset?
---
title: "Beyond the Gene List"
author: "Mark Dunning; mark 'dot' dunning 'at' cruk.cam.ac.uk"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: yes
  html_document:
    toc: yes
---

```{r setup, include=FALSE,echo=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

# How to manipulate a gene list

The golden rule before we start ***Always keep a backup of the output from the pipeline, never work on the original data***!

There are various tasks you might wish to perform on a gene list

- Search for our favourite gene
- Sort / Rank according to statistic or p-value
- Filter to obtain genes with particular cut-off and fold-change
- Re-order / re-arrange columns

Our recommended tools to perform such operations would be a programming language such as R / Python so that the operations can be scripted and automated. As a compromise, our Galaxy server provides various operations through an intuitive interface and allows the user to build a workflow to a chain of simple operations.

Different online tools or GUIs will take different types of input, but will probably involve some combination of the operations mentioned above. Therefore, we will introduce some tools in Galaxy that should allow you to manipulate your data into the required form. Later, we will give examples of workflows for particular gene set enrichment / pathways analysis tools.

In the first generic example, we will show how to filter a gene list using Galaxy:-

## How to apply an alternative filter to the gene list in Galaxy

### Upload the gene list in csv form

The gene list `.csv` file can be uploaded into Galaxy. In this example we want to choose the file with no p-value cut-offs applied

![](images/filter1.png)

### Convert to tabular

Before we can go further, Galaxy needs to convert the data you just uploaded into *tabular* form by replacing commas with a tab. This can be done using the ***Text Manipulation*** -> ***Convert delimiters to TAB*** option. Make sure that *Commas* is selected from the drop-down.

![](images/filter2.png)

### Filter

We can apply filtering using the menu option ***Filter and Sort*** -> ***Filter data on any column using simple expressions***. Here we use the condition `c7 <0.01 and c3 > 1.5` to require that Column 7 (adjusted p-value) is less than 0.05 *and* Column 3 (log fold-change) is greater than 1.5.


![](images/filter3.png)

You could also use this menu to filter on genes on a particular chromosome or within a certain range of start and end positions

## Selecting columns

![](images/filter4.png)

## Sorting by a column

We can also sort the table by values in a particular column. e.g. the log$_2$ fold change.
![](images/filter5.png)

- can also sort by multiple columns
    + e.g. Chromosome 
## R workflow

For those keen on R, equivalent operations can be performed with the `dplyr` package. See our [intermediate R](http://bioinformatics-core-shared-training.github.io/r-intermediate/) course for details. 

```{r message=FALSE}
library(dplyr)
deTable <- read.csv("t47d_Treatment_DEA_Prog-vs-Control_all.csv")
filteredTable <- filter(deTable, padj < 0.05, log2FoldChange > 1.5)
sortedTable <- arrange(filteredTable, log2FoldChange)
reducedTable <- select(filteredTable, X,baseMean,log2FoldChange)
```

# Gene-Ontologies and Pathways

## Gene-Ontology Analysis

In the early days of microarray analysis, people were happy if they got a handful of differentially-expressed genes that they could validate or follow-up. However, with later technologies (and depending on the experimental setup) we might have thousands of statistically-significant results, which no-one has the time to follow-up. Also, we might be interested in pathways / mechanisms that are altered and not just individual genes.

In this section we move towards discovering if our results are ***biologically significant***. Are the genes that we have picked statistical flukes, or are there some commonalities. 

There are two different approaches one might use, and we will cover the theory behind both


## Theory Part I: Over-representation analysis

- "Threshold-based": require defintion of a statistical threshold to define list of genes to test (e.g. FDR < 0.01)
- Hypergeometric test or Fisher's Exact test generally used.

The question we are asking here is;

> ***"Are the number of DE genes associated with Theme X significantly greater than what we might expect by chance alone?"***

Where Theme X could be genes belonging to a particular GO (Gene Onotology) term or pathway.

Let's imagine that we have a bag full of balls. Each balls represents a gene in the *gene universe*. 
- Paint the balls representing our selected list grey, and paint the rest red.


![](images/bag-and-balls.png)

In this small example, we can know in advance the total number of balls and total number in our category of interest

- Total number of balls: 40
- Total number of interesting (grey) balls: 10

Now, lets select a number (say, 12) of the balls at random without seeing into the bag and look at what we get

![](images/picked-balls.png)



We can express a particular choice of balls as a table:-

|  | In Selection | Not In Selection | Total
------------- | ------------- |  ------------- | ------------- | 
Grey Selected | 8 | 2 | 10 |
Grey Not Selected  | 4 | 26 | 30 |
Total | 12 | 6 | 40 |



The formula for Fishers exact test is;

$$ p = \frac{\binom{a + b}{a}\binom{c +d}{c}}{\binom{n}{a +c}} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!} $$
with :-

|  | In Selection | Not In Selection | Total
------------- | ------------- |  ------------- | ------------- | 
Grey Selected | a | b | a  +  b |
Grey Not Selected  | c | d | c + d |
Total | a + c | b +d | a + b + c + d (=n) |




or less formally;

*P = (ways of choosing grey balls) X (ways of non-grey balls amongst subset) / ways of choosing subset*


Fortunately, we have software to calculate these quantities!

## Software for conducting a over-representation test (goseq)

We will be using [`goseq`](https://www.bioconductor.org/packages/release/bioc/html/goseq.html), which is a software package available through [Bioconductor](www.bioconductor.org/). However, rather having to write R code, we will be using the package through our institute's Galaxy server.

This package has been *specifically-developed* for use with RNA-seq data. Plenty of methods have been applied to microarray data, but the assumptions might not hold for RNA-seq. For instance, `goseq` will adjust for the length of a gene.

## Preparing the data for an over-representation test

We can start with the list that contains results for *all* genes; Let's take `t47d_Treatment_DEA_Prog-vs-Control_all.csv`

```{r}
deTable <- read.csv("t47d_Treatment_DEA_Prog-vs-Control_all.csv")
deTable
```

The goal is to obtain a table with two columns; the first containing the gene identifier, and the second being `TRUE` or `FALSE` indicating whether the gene is differentially-expressed at a given cut-off (e.g. `0.05`). 

There are several ways of doing this. If you wish to see the R code (using `dplyr`), click the 'Code' button on the right.

```{r message=FALSE}
library(dplyr)
deTable <- read.csv("t47d_Treatment_DEA_Prog-vs-Control_all.csv")
mutate(deTable, DE = padj < 0.05) %>% 
  mutate(DE = ifelse(is.na(DE),FALSE,DE)) %>% 
  select(X, DE) -> filteredTable
  write.table(filteredTable,"de-table-for-goseq.txt",quote=FALSE,row.names=FALSE)
filteredTable
```

Manipulating the file can also be done in Galaxy.

## Importing the Gene list into Galaxy

### Import the csv file

***Get Data*** -> ***Upload Data***

![](images/galaxy-1.png)

### Convert to tabular form

***Text Manipulation*** -> ***Convert delimiters to TAB***

Select *Commas* from drop-down menu

![](images/galaxy-2.png)

### Remove header line

***Text Manipulation*** -> ***Select last lines from a dataset***

Make note of how many lines in file, *N* and remove *N-1* last lines with this tool
![](images/galaxy-3.png)

### Retain columns of interest

Put *c1,c2,c3,c4,c5,c6,c7,c12* to retain columns 1,2,3,4,5,6,7 and 12

***Text Manipulation*** -> ***Cut columns from a table***

![](images/galaxy-4.png)

### The cleaned dataset

The output from the previous step will be referred to as our *cleaned dataset*

![](images/galaxy-5.png)


## Performing a Gene Set Analysis in Galaxy

From our cleaned table in the previous steps, we need to compute if each gene is DE at 0.05 significance level

### Adding an extra column

***Text Manipulation*** -> ***Compute an expression on every row***

Use the expression `c7<0.05` to test if the adjusted p-value is less than 0.05

![](images/galaxy-6.png)

Which should give something like this

![](images/galaxy-7.png)

### Extracting the columns needed for goseq

***Text Manipulation*** -> ***Cut columns from a table***
Choose `c1,c9` as the columns to cut

![](images/galaxy-8.png)

Which should give the following

![](images/galaxy-9.png)

### Running goseq

***NGS:RNA-seq** -> goseq

Select Genome version *hg19* in this case. Could also specify which type of identifiers can be found in the first column. *Ensembl* ID is correct in this case.

![](images/galaxy-10.png)

### Exercise:

The differentially-expressed genes that we identified could either by up- or down-regulated. Sometimes we might want to look for over-represented genes in the up- or down-regulated genes only

- Let's suppose we want to analyse the up-regulated genes only
- Consider what changes you would need to make to the workflow and perform the analysis

## Theory Part II: Threshold-free

For these tests, we don't have to specify a statistical threshold and use the test statistics from *all* genes as the input to the test. The popular *Gene Set Enrichment Analysis (GSEA)* uses this approach. These tests can be used to detect differential expression for a group of genes, even when the effects are too small or there is too little data to detect the genes individually.

![*Subramanian et al, PNAS 2005*](images/GSEA.png)

- Rank all your genes according to test statistic or fold-change
- For a particular gene set of interest, look down the gene list and compute a score
    + Go up one unit on the y-axis if the gene you encounter is contained in your gene list
    + Go down one unit if the gene is not contained in the gene list
    + Plot the score against position in gene list
    + If there is a peak at the left or right, then your set is enriched.

Let's say that we have 10 genes "of interest" and can look-up their statistics from the table
```{r echo=FALSE}
library(limma)
deTable <- read.csv("t47d_Treatment_DEA_Prog-vs-Control_all.csv")
stats <- deTable$stat
statsOrder <- order(stats,decreasing = TRUE)
myIndex <-  c(sample(statsOrder[1:500],5),sample(501:length(stats),5))
data.frame(Gene = deTable$symbol[myIndex], Statistic = stats[myIndex])
```


The Broad institute provides [a version of GSEA](http://software.broadinstitute.org/gsea/index.jsp) that can be run via a java application. Generically, the input to GSEA is matrix of expression values / counts and a file describing which covariate each sample belongs to. However, whether or not the method can be applied directly to RNA-seq data is under debate. According to the [GSEA faq](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#Can_I_use_GSEA_to_analyze_SNP.2C_SAGE.2C_ChIP-Seq_or_RNA-Seq_data.3F)

> The GSEA team has yet to determine whether any of these ranking statistics, originally selected for their effectiveness when used with expression data derived from DNA Microarray experiments, are appropriate for use with expression data derived from RNA-seq experiments. We hopefully will be able to devote some time to investigating this, but in the mean time, we are recommending use of the GSEAPreranked tool for conducting gene set enrichment analysis of data derived from RNA-seq experiments.

The GSEAPreranked tool mentioned is available through the Broad insitute's [GenePattern](https://genepattern.broadinstitute.org) analysis suite. However, we don't want to introduce yet another analysis suite, so we will focus on how to perform this analysis in Galaxy.



## Preparing the gene set for a GSEA analysis


```{r}
library(dplyr)
deTable <- read.csv("deseq2_results/t47d_Treatment_DEA_Prog-vs-Control_all.csv")
deTable %>% arrange(desc(stat)) %>% 
  dplyr:::select(symbol, stat) %>% 
  filter(!is.na(symbol)) -> orderedTable
orderedTable
```


### Ranking the genes

From the *cleaned* table created above, sort on column 5 (the test statistic)

***Filter and Sort*** -> ***Sort data in ascending or descending order***

![](images/galaxy-10.png)

Giving the output:-

![](images/galaxy-11.png)

### Cutting the columns

Extract columns 8 (gene symbol) followed by test statistic; `c8,c5`

***Text Manipulation*** -> ***Cut columns from a table***

![](images/galaxy-12.png)


## Running a GSEA analysis

### Using the fGSEA tool

`fGSEA` is a Bioconductor package that implements the pre-ranked GSEA analysis. It also happens to be a faster implementation, as described in their [pre-print](http://biorxiv.org/content/early/2016/06/20/060012)

GSEA uses a set of pre-defined gene sets in their analysis which come under the following categories:-

- H hallmark gene sets
- C1 positional gene sets
- C2 curated gene sets
- C3 motif gene sets
- C4 computational gene sets
- C5 GO gene sets
- C6 oncogenic signatures
- C7 immunologic signatures

![](images/galaxy-13.png)

Gives all results ranked according to..

- Can we identify gene sets that are significant at 0.05?
- What genes are over- and under-represented?
- Can you run the tool on another dataset?